{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "import re"
   ]
  },
  {
   "source": [
    "# Ch07 从文本提取信息\n",
    "\n",
    "学习目标\n",
    "\n",
    "1.  从非结构化文本中提取结构化数据\n",
    "2.  识别一个文本中描述的实体和关系\n",
    "3.  使用语料库来训练和评估模型"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 7.1 信息提取\n",
    "\n",
    "从文本中获取意义的方法被称为「信息提取」\n",
    "\n",
    "1.  从结构化数据中提取信息\n",
    "2.  从非结构化文本中提取信息\n",
    "    -   建立一个非常一般的含义\n",
    "    -   查找文本中具体的各种信息\n",
    "        -   将非结构化数据转换成结构化数据\n",
    "        -   使用查询工具从文本中提取信息"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 7.1.1 信息提取结构\n",
    "｛原始文本（一串）｝→ 断句 \n",
    "\n",
    "→｛句子（字符串列表）｝→ 分词 \n",
    "\n",
    "→｛句子分词｝→ 词性标注 \n",
    "\n",
    "→｛句子词性标注｝→ 实体识别 \n",
    "\n",
    "→｛句子分块｝→ 关系识别 \n",
    "\n",
    "→｛关系列表｝"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['BBDO South', 'Georgia-Pacific']\n"
     ]
    }
   ],
   "source": [
    "# P283 图7-1，Ex7-1：信息提取结构元组（entity, relation, entity)\n",
    "# 建立标准化数据库\n",
    "locs = [('Omnicom', 'IN', 'New York'),\n",
    "        ('DDB Needham', 'IN', 'New York'),\n",
    "        ('Kaplan Thaler Group', 'IN', 'New York'),\n",
    "        ('BBDO South', 'IN', 'Atlanta'),\n",
    "        ('Georgia-Pacific', 'IN', 'Atlanta')]\n",
    "# 依据查询提取信息\n",
    "query = [\n",
    "        e1\n",
    "        for (e1, rel, e2) in locs\n",
    "        if e2 == 'Atlanta'\n",
    "]\n",
    "print(query)"
   ]
  },
  {
   "source": [
    "## 7.2 分块：用于实体识别的基本技术（P284 图7-2）\n",
    "\n",
    "分块构成的源文本中的片段不能重叠\n",
    "\n",
    "-   小框显示词级标识符和词性标注\n",
    "-   大框表示组块（chunk），是较高级别的程序分块\n",
    "\n",
    "分块的方法\n",
    "\n",
    "-  正则表达式和N-gram方法分块；\n",
    "-  使用CoNLL-2000分块语料库开发和评估分块器；"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 7.2.1 名词短语分块（NP-chunking，即“NP-分块”）寻找单独名词短语对应的块\n",
    "\n",
    "NP-分块是比完整的名词短语更小的片段，不包含其他的NP-分块，修饰一个任何介词短语或者从句将不包括在相应的NP-分块内。\n",
    " \n",
    "NP-分块信息最有用的来源之一是词性标记。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP the/DT little/JJ yellow/JJ dog/NN)\n  barked/VBD\n  at/IN\n  (NP the/DT cat/NN))\n"
     ]
    }
   ],
   "source": [
    "# P285 Ex7-1 基于正则表达式的NP 分块器\n",
    "# 使用分析器对句子进行分块\n",
    "sentence = [('the', 'DT'),\n",
    "            ('little', 'JJ'),\n",
    "            ('yellow', 'JJ'),\n",
    "            ('dog', 'NN'),\n",
    "            ('barked', 'VBD'),\n",
    "            ('at', 'IN'),\n",
    "            ('the', 'DT'),\n",
    "            ('cat', 'NN')]\n",
    "# 定义分块语法\n",
    "grammar = 'NP: {<DT>?<JJ>*<NN>}'\n",
    "# 创建组块分析器\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "# 对句子进行分块\n",
    "result = cp.parse(sentence)\n",
    "# 输出分块的树状图\n",
    "print(result)\n",
    "result.draw()"
   ]
  },
  {
   "source": [
    "### 7.2.2. 标记模式"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP another/DT sharp/JJ dive/NN trade/NN figures/NNS)\n  (NP any/DT new/JJ policy/NN measures/NNS)\n  (NP earlier/JJR stages/NNS)\n  (NP Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP))\n"
     ]
    }
   ],
   "source": [
    "# 华尔街日报\n",
    "sentence = [('another', 'DT'),\n",
    "            ('sharp', 'JJ'),\n",
    "            ('dive', 'NN'),\n",
    "            ('trade', 'NN'),\n",
    "            ('figures', 'NNS'),\n",
    "            ('any', 'DT'),\n",
    "            ('new', 'JJ'),\n",
    "            ('policy', 'NN'),\n",
    "            ('measures', 'NNS'),\n",
    "            ('earlier', 'JJR'),\n",
    "            ('stages', 'NNS'),\n",
    "            ('Panamanian', 'JJ'),\n",
    "            ('dictator', 'NN'),\n",
    "            ('Manuel', 'NNP'),\n",
    "            ('Noriega', 'NNP')]\n",
    "grammar = 'NP: {<DT>?<JJ.*>*<NN.*>+}'\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "result = cp.parse(sentence)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Grammar中输入语法，语法格式{<DT>?<JJ.*>*<NN.*>+}，不能在前面加NP:，具体可以参考右边的Regexps说明\n",
    "# Development Set就是开发测试集，用于调试语法规则。绿色表示正确匹配，红色表示没有正确匹配。黄金标准标注为下划线\n",
    "nltk.app.chunkparser()"
   ]
  },
  {
   "source": [
    "### 7.2.3 用正则表达式分块（组块分析）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rapunzel/NNP)\n  let/VBD\n  down/RP\n  (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "# Ex7-2 简单的名词短语分类器\n",
    "sentence = [(\"Rapunzel\", \"NNP\"),\n",
    "            (\"let\", \"VBD\"),\n",
    "            (\"down\", \"RP\"),\n",
    "            (\"her\", \"PP$\"),\n",
    "            (\"long\", \"JJ\"),\n",
    "            (\"golden\", \"JJ\"),\n",
    "            (\"hair\", \"NN\")]\n",
    "# 两个规则组成的组块分析语法，注意规则执行会有先后顺序，两个规则如果有重叠部分，以先执行的为准\n",
    "grammar = r'''\n",
    "  NP: {<DT|PP\\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun\n",
    "      {<NNP>+}                # chunk sequences of proper nouns\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  Rapunzel/NNP\n  let/VBD\n  down/RP\n  her/PP$\n  (NP long/JJ golden/JJ)\n  hair/NN)\n"
     ]
    }
   ],
   "source": [
    "grammar = r'NP: {<[CDJ].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rapunzel/NNP)\n  let/VBD\n  down/RP\n  (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r'NP: {<[CDJNP].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rapunzel/NNP)\n  let/VBD\n  down/RP\n  her/PP$\n  (NP long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r'NP: {<[CDJN].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "错误分块的结果=  (S (NP money/NN market/NN) fund/NN)\n正确分块的结果=  (S (NP money/NN market/NN fund/NN))\n"
     ]
    }
   ],
   "source": [
    "# 如果模式匹配位置重叠，最左边的优先匹配。\n",
    "# 例如：如果将匹配两个连贯名字的文本的规则应用到包含3个连贯名词的文本中，则只有前两个名词被分块\n",
    "nouns = [('money', 'NN'), ('market', 'NN'), ('fund', 'NN')]\n",
    "grammar = 'NP: {<NN><NN>}'\n",
    "print(\"错误分块的结果= \", nltk.RegexpParser(grammar).parse(nouns))\n",
    "grammar = 'NP: {<NN>+}'\n",
    "print(\"正确分块的结果= \", nltk.RegexpParser(grammar).parse(nouns))"
   ]
  },
  {
   "source": [
    "### 7.2.4 探索文本语料库：从已经标注的语料库中提取匹配特定词性标记序列的短语"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n(CHUNK continue/VB to/TO place/VB)\n(CHUNK serve/VB to/TO protect/VB)\n(CHUNK wanted/VBD to/TO wait/VB)\n(CHUNK allowed/VBN to/TO place/VB)\n(CHUNK expected/VBN to/TO become/VB)\n(CHUNK expected/VBN to/TO approve/VB)\n(CHUNK expected/VBN to/TO make/VB)\n(CHUNK intends/VBZ to/TO make/VB)\n(CHUNK seek/VB to/TO set/VB)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'CHUNK: {<V.*><TO><V.*>}'\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "brown = nltk.corpus.brown\n",
    "count = 0\n",
    "for sent in brown.tagged_sents():\n",
    "    if count < 10:\n",
    "        tree = cp.parse(sent)\n",
    "        for subtree in tree.subtrees():\n",
    "            if subtree.label() == 'CHUNK':\n",
    "                count += 1\n",
    "                print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个搜索函数（一次性返回定义好的数据量）\n",
    "def find_chunks(pattern):\n",
    "    cp = nltk.RegexpParser(pattern)\n",
    "    brown = nltk.corpus.brown\n",
    "    count = 0\n",
    "    for sent in brown.tagged_sents():\n",
    "        if count < 10:\n",
    "            tree = cp.parse(sent)\n",
    "            for subtree in tree.subtrees():\n",
    "                if subtree.label() == 'CHUNK' or subtree.label()=='NOUNS':\n",
    "                    count += 1\n",
    "                    print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n(CHUNK continue/VB to/TO place/VB)\n(CHUNK serve/VB to/TO protect/VB)\n(CHUNK wanted/VBD to/TO wait/VB)\n(CHUNK allowed/VBN to/TO place/VB)\n(CHUNK expected/VBN to/TO become/VB)\n(CHUNK expected/VBN to/TO approve/VB)\n(CHUNK expected/VBN to/TO make/VB)\n(CHUNK intends/VBZ to/TO make/VB)\n(CHUNK seek/VB to/TO set/VB)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'CHUNK: {<V.*><TO><V.*>}'\n",
    "find_chunks(grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)\n(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)\n(NOUNS Georgia's/NP$ automobile/NN title/NN law/NN)\n(NOUNS State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)\n(NOUNS Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)\n(NOUNS Mayor/NN-TL William/NP B./NP Hartsfield/NP)\n(NOUNS Mrs./NP J./NP M./NP Cheshire/NP)\n(NOUNS E./NP Pelham/NP Rd./NN-TL Aj/NN)\n(NOUNS\n  State/NN-TL\n  Party/NN-TL\n  Chairman/NN-TL\n  James/NP\n  W./NP\n  Dorsey/NP)\n(NOUNS Texas/NP Sen./NN-TL John/NP Tower/NP)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'NOUNS: {<N.*>{4,}}'\n",
    "find_chunks(grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个搜索函数（使用生成器）\n",
    "def find_chunks(pattern):\n",
    "    cp = nltk.RegexpParser(pattern)\n",
    "    brown = nltk.corpus.brown\n",
    "    for sent in brown.tagged_sents():\n",
    "        tree = cp.parse(sent)\n",
    "        for subtree in tree.subtrees():\n",
    "            if subtree.label() == 'CHUNK' or subtree.label() == 'NOUNS':\n",
    "                yield subtree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n(CHUNK continue/VB to/TO place/VB)\n(CHUNK serve/VB to/TO protect/VB)\n(CHUNK wanted/VBD to/TO wait/VB)\n(CHUNK allowed/VBN to/TO place/VB)\n(CHUNK expected/VBN to/TO become/VB)\n(CHUNK expected/VBN to/TO approve/VB)\n(CHUNK expected/VBN to/TO make/VB)\n(CHUNK intends/VBZ to/TO make/VB)\n(CHUNK seek/VB to/TO set/VB)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'CHUNK: {<V.*><TO><V.*>}'\n",
    "for i, subtree in enumerate(find_chunks(grammar)):\n",
    "    if i < 10:\n",
    "        print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)\n(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)\n(NOUNS Georgia's/NP$ automobile/NN title/NN law/NN)\n(NOUNS State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)\n(NOUNS Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)\n(NOUNS Mayor/NN-TL William/NP B./NP Hartsfield/NP)\n(NOUNS Mrs./NP J./NP M./NP Cheshire/NP)\n(NOUNS E./NP Pelham/NP Rd./NN-TL Aj/NN)\n(NOUNS\n  State/NN-TL\n  Party/NN-TL\n  Chairman/NN-TL\n  James/NP\n  W./NP\n  Dorsey/NP)\n(NOUNS Texas/NP Sen./NN-TL John/NP Tower/NP)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'NOUNS: {<N.*>{4,}}'\n",
    "for i, subtree in enumerate(find_chunks(grammar)):\n",
    "    if i < 10:\n",
    "        print(subtree)"
   ]
  },
  {
   "source": [
    "### 7.2.5. 添加缝隙：寻找需要排除的成分\n",
    "\n",
    "可以为不包括在大块中的标识符序列定义一个缝隙。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = [(\"the\", \"DT\"),\n",
    "            (\"little\", \"JJ\"),\n",
    "            (\"yellow\", \"JJ\"),\n",
    "            (\"dog\", \"NN\"),\n",
    "            (\"barked\", \"VBD\"),\n",
    "            (\"at\", \"IN\"),\n",
    "            (\"the\", \"DT\"),\n",
    "            (\"cat\", \"NN\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP the/DT little/JJ yellow/JJ dog/NN)\n  barked/VBD\n  at/IN\n  (NP the/DT cat/NN))\n"
     ]
    }
   ],
   "source": [
    "# 先分块，再加缝隙，才能得出正确的结果\n",
    "grammar = r'''\n",
    "    NP: \n",
    "        {<.*>+}         # Chunk everything （先对所有数据分块）\n",
    "        }<VBD|IN>+{     # Chink sequences of VBD and IN（对 VBD 或者 IN 加缝隙）\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP\n    the/DT\n    little/JJ\n    yellow/JJ\n    dog/NN\n    barked/VBD\n    at/IN\n    the/DT\n    cat/NN))\n"
     ]
    }
   ],
   "source": [
    "# 先加缝隙，再分块，就不能得出正确的结果，只会得到一个块，效果与没有使用缝隙是一样的\n",
    "grammar = r'''\n",
    "    NP: \n",
    "        }<VBD|IN>+{     # Chink sequences of VBD and IN\n",
    "        {<.*>+}         # Chunk everything\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP\n    the/DT\n    little/JJ\n    yellow/JJ\n    dog/NN\n    barked/VBD\n    at/IN\n    the/DT\n    cat/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r'''\n",
    "    NP: \n",
    "        {<.*>+}         # Chunk everything\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "source": [
    "### 7.2.6 分块的表示：标记与树状图\n",
    "\n",
    "作为「标注」和「分析」之间的中间状态（Ref：Ch8），块结构可以使用标记或者树状图来表示\n",
    "\n",
    "使用最为广泛的表示是IOB标记：\n",
    "\n",
    "-   I（Inside，内部）；\n",
    "-   O（Outside，外部）；\n",
    "-   B（Begin，开始）。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 7.3 开发和评估分块器"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 7.3.1 读取 IOB格式 和 CoNLL2000 语料库"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 不能在text里面加入“空格”和“置表符”用来控制文本的格式\n",
    "text = '''\n",
    "he PRP B-NP\n",
    "accepted VBD B-VP\n",
    "the DT B-NP\n",
    "position NN I-NP\n",
    "of IN B-PP\n",
    "vice NN B-NP\n",
    "chairman NN I-NP\n",
    "of IN B-PP\n",
    "Carlyle NNP B-NP\n",
    "Group NNP I-NP\n",
    ", , O\n",
    "a DT B-NP\n",
    "merchant NN I-NP\n",
    "banking NN I-NP\n",
    "concern NN I-NP\n",
    ". . O\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 绘制块结构的树状图表示\n",
    "nltk.chunk.conllstr2tree(text, chunk_types=('NP',)).draw()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "nltk.chunk.conllstr2tree(text, chunk_types=('NP', 'VP')).draw()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Confidence/NN)\n  in/IN\n  (NP the/DT pound/NN)\n  is/VBZ\n  widely/RB\n  expected/VBN\n  to/TO\n  take/VB\n  (NP another/DT sharp/JJ dive/NN)\n  if/IN\n  (NP trade/NN figures/NNS)\n  for/IN\n  (NP September/NNP)\n  ,/,\n  due/JJ\n  for/IN\n  (NP release/NN)\n  (NP tomorrow/NN)\n  ,/,\n  fail/VB\n  to/TO\n  show/VB\n  (NP a/DT substantial/JJ improvement/NN)\n  from/IN\n  (NP July/NNP and/CC August/NNP)\n  (NP 's/POS near-record/JJ deficits/NNS)\n  ./.)\n"
     ]
    }
   ],
   "source": [
    "# CoNLL2000分块语料库包括3种分块类型：NP、VP、PP\n",
    "from nltk.corpus import conll2000\n",
    "\n",
    "train_sents = conll2000.chunked_sents('train.txt', chunk_types='NP')\n",
    "print(train_sents[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Confidence/NN)\n  in/IN\n  (NP the/DT pound/NN)\n  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)\n  (NP another/DT sharp/JJ dive/NN)\n  if/IN\n  (NP trade/NN figures/NNS)\n  for/IN\n  (NP September/NNP)\n  ,/,\n  due/JJ\n  for/IN\n  (NP release/NN)\n  (NP tomorrow/NN)\n  ,/,\n  (VP fail/VB to/TO show/VB)\n  (NP a/DT substantial/JJ improvement/NN)\n  from/IN\n  (NP July/NNP and/CC August/NNP)\n  (NP 's/POS near-record/JJ deficits/NNS)\n  ./.)\n"
     ]
    }
   ],
   "source": [
    "train_sents = conll2000.chunked_sents('train.txt', chunk_types=('NP', 'VP'))\n",
    "print(train_sents[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Confidence/NN)\n  (PP in/IN)\n  (NP the/DT pound/NN)\n  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)\n  (NP another/DT sharp/JJ dive/NN)\n  if/IN\n  (NP trade/NN figures/NNS)\n  (PP for/IN)\n  (NP September/NNP)\n  ,/,\n  due/JJ\n  (PP for/IN)\n  (NP release/NN)\n  (NP tomorrow/NN)\n  ,/,\n  (VP fail/VB to/TO show/VB)\n  (NP a/DT substantial/JJ improvement/NN)\n  (PP from/IN)\n  (NP July/NNP and/CC August/NNP)\n  (NP 's/POS near-record/JJ deficits/NNS)\n  ./.)\n"
     ]
    }
   ],
   "source": [
    "train_sents = conll2000.chunked_sents('train.txt', chunk_types=('NP', 'VP', 'PP'))\n",
    "print(train_sents[0])"
   ]
  },
  {
   "source": [
    "### 7.3.2 简单的评估和基准"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rockwell/NNP International/NNP Corp./NNP)\n  (NP 's/POS Tulsa/NNP unit/NN)\n  said/VBD\n  (NP it/PRP)\n  signed/VBD\n  (NP a/DT tentative/JJ agreement/NN)\n  extending/VBG\n  (NP its/PRP$ contract/NN)\n  with/IN\n  (NP Boeing/NNP Co./NNP)\n  to/TO\n  provide/VB\n  (NP structural/JJ parts/NNS)\n  for/IN\n  (NP Boeing/NNP)\n  (NP 's/POS 747/CD jetliners/NNS)\n  ./.)\n"
     ]
    }
   ],
   "source": [
    "# 建立基准\n",
    "test_sents = conll2000.chunked_sents('test.txt', chunk_types='NP')\n",
    "print(test_sents[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  43.4%%\n    Precision:      0.0%%\n    Recall:         0.0%%\n    F-Measure:      0.0%%\n"
     ]
    }
   ],
   "source": [
    "# 没有任何语法规则，即所有的词都被标注为O\n",
    "print(nltk.RegexpParser('').evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n",
      "  (NP Rockwell/NNP International/NNP Corp./NNP)\n",
      "  (NP 's/POS Tulsa/NNP unit/NN)\n",
      "  said/VBD\n",
      "  (NP it/PRP)\n",
      "  signed/VBD\n",
      "  (NP a/DT tentative/JJ agreement/NN)\n",
      "  extending/VBG\n",
      "  (NP its/PRP$ contract/NN)\n",
      "  with/IN\n",
      "  (NP Boeing/NNP Co./NNP)\n",
      "  to/TO\n",
      "  provide/VB\n",
      "  (NP structural/JJ parts/NNS)\n",
      "  for/IN\n",
      "  (NP Boeing/NNP)\n",
      "  (NP 's/POS 747/CD jetliners/NNS)\n",
      "  ./.)\n",
      "ChunkParse score:\n",
      "    IOB Accuracy:  87.7%%\n",
      "    Precision:     70.6%%\n",
      "    Recall:        67.8%%\n",
      "    F-Measure:     69.2%%\n"
     ]
    }
   ],
   "source": [
    "# 正则表达式分块器\n",
    "grammar = r'NP: {<[CDJNP].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(test_sents)[0])\n",
    "print(nltk.RegexpParser(grammar).evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "class UnigramChunker(nltk.ChunkParserI):\n",
    "    def __init__(self, train_sents):\n",
    "        train_data = [\n",
    "            [(t, c)\n",
    "             for w, t, c in nltk.chunk.tree2conlltags(sent)]  # 准备训练用的数据\n",
    "            for sent in train_sents]\n",
    "        self.tagger = nltk.UnigramTagger(train_data)  # 使用训练数据训练一元语法标注器\n",
    "\n",
    "    def parse(self, sentence):\n",
    "        pos_tags = [pos for (word, pos) in sentence]\n",
    "        # 需要标注的内容 ['NN','CC','DT','PRP'...]\n",
    "        tagged_pos_tag = self.tagger.tag(pos_tags)\n",
    "        # 标注好的结果 [('NNP','I-NP'),(',','O')...]\n",
    "        chunktags = [\n",
    "            chunktag\n",
    "            for (pos, chunktag) in tagged_pos_tag\n",
    "        ]  # 把标注好的结果选出来\n",
    "        conlltags = [\n",
    "            (word, pos, chunktag)\n",
    "            for ((word, pos), chunktag) in zip(sentence, chunktags)\n",
    "        ]  # 组成最后需要输出的结果\n",
    "        # 最后输出的结果：[('Rockwell', 'NNP', 'I-NP'), ('International', 'NNP', 'I-NP')...]\n",
    "        return nltk.chunk.conlltags2tree(conlltags)  # 将结果转化成树块的方式输出"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  92.9%%\n    Precision:     79.9%%\n    Recall:        86.8%%\n    F-Measure:     83.2%%\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import conll2000\n",
    "\n",
    "test_sents = conll2000.chunked_sents('test.txt', chunk_types='NP')\n",
    "train_sents = conll2000.chunked_sents('train.txt', chunk_types='NP')\n",
    "\n",
    "# 评估unigram标注器的性能\n",
    "unigram_chunker = UnigramChunker(train_sents)\n",
    "print(unigram_chunker.evaluate(test_sents))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[('NN', 'B-NP'), ('IN', 'O'), ('DT', 'B-NP'), ('NN', 'I-NP'), ('VBZ', 'O'), ('RB', 'O'), ('VBN', 'O'), ('TO', 'O'), ('VB', 'O'), ('DT', 'B-NP'), ('JJ', 'I-NP'), ('NN', 'I-NP'), ('IN', 'O'), ('NN', 'B-NP'), ('NNS', 'I-NP'), ('IN', 'O'), ('NNP', 'B-NP'), (',', 'O'), ('JJ', 'O'), ('IN', 'O'), ('NN', 'B-NP'), ('NN', 'B-NP'), (',', 'O'), ('VB', 'O'), ('TO', 'O'), ('VB', 'O'), ('DT', 'B-NP'), ('JJ', 'I-NP'), ('NN', 'I-NP'), ('IN', 'O'), ('NNP', 'B-NP'), ('CC', 'I-NP'), ('NNP', 'I-NP'), ('POS', 'B-NP'), ('JJ', 'I-NP'), ('NNS', 'I-NP'), ('.', 'O')]\n"
     ]
    }
   ],
   "source": [
    "# 训练用的数据格式\n",
    "train_data = [\n",
    "    [(t, c) \n",
    "     for w, t, c in nltk.chunk.tree2conlltags(sent)] \n",
    "    for sent in train_sents]\n",
    "print(train_data[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[('#', 'B-NP'), ('$', 'B-NP'), (\"''\", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]\n"
     ]
    }
   ],
   "source": [
    "# 一元标注器对于标签的标注结果\n",
    "postags = sorted(set(\n",
    "    pos\n",
    "    for sent in train_sents\n",
    "    for (word, pos) in sent.leaves()))\n",
    "print(unigram_chunker.tagger.tag(postags))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 试着自己建立一个二元标注器\n",
    "class BigramChunker(nltk.ChunkParserI):\n",
    "    def __init__(self, train_sents):\n",
    "        train_data = [\n",
    "            [(t, c)\n",
    "             for w, t, c in nltk.chunk.tree2conlltags(sent)]\n",
    "            for sent in train_sents]\n",
    "        self.tagger = nltk.BigramTagger(train_data)\n",
    "\n",
    "    def parse(self, sentence):\n",
    "        pos_tags = [\n",
    "            pos\n",
    "            for (word, pos) in sentence]\n",
    "        tagged_pos_tag = self.tagger.tag(pos_tags)\n",
    "        chunktags = [\n",
    "            chunktag\n",
    "            for (pos, chunktag) in tagged_pos_tag]\n",
    "        conlltags = [\n",
    "            (word, pos, chunktag)\n",
    "            for ((word, pos), chunktag) in zip(sentence, chunktags)]\n",
    "        return nltk.chunk.conlltags2tree(conlltags)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  93.3%%\n    Precision:     82.3%%\n    Recall:        86.8%%\n    F-Measure:     84.5%%\n"
     ]
    }
   ],
   "source": [
    "# 二元标注器对性能的提高非常有限\n",
    "bigram_chunker = BigramChunker(train_sents)\n",
    "print(bigram_chunker.evaluate(test_sents))"
   ]
  },
  {
   "source": [
    "### 7.3.3 训练基于分类器的分块器\n",
    "想要最大限度地提升分块的性能，需要使用词的内容信息作为词性标记的补充。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ex-7.5. 使用连续分类器（最大熵分类器）对名词短语分块（i5-5200U，执行时间20分钟）\n",
    "# 不能使用megam算法，megam表示LM-BFGS algorithm，需要使用External Libraries，\n",
    "# Windows用户就不要尝试了，因为作者根本没有提供Windows的安装版本\n",
    "# 取消algorithm='megam'设置，使用默认的算法就可以了-->The default algorithm = 'IIS'(Improved Iterative Scaling )\n",
    "# ConsecutiveNPChunkTagger与Ex6-5中的ConsecutivePosTagger类相同，区别只有特征提取器不同。\n",
    "class ConsecutiveNPChunkTagger(nltk.TaggerI):\n",
    "    def __init__(self, train_sents):\n",
    "        train_set = []\n",
    "        for tagged_sent in train_sents:\n",
    "            untagged_sent = nltk.tag.untag(tagged_sent)\n",
    "            history = []\n",
    "            for i, (word, tag) in enumerate(tagged_sent):\n",
    "                # print(untagged_sent, i, history)\n",
    "                featureset = npchunk_features(untagged_sent, i, history)\n",
    "                train_set.append((featureset, tag))\n",
    "                history.append(tag)\n",
    "        # self.classifier = nltk.MaxentClassifier.train(train_set, algorithm='megam', trace=0)\n",
    "        self.classifier = nltk.MaxentClassifier.train(train_set, trace=0)\n",
    "\n",
    "    def tag(self, sentence):\n",
    "        history = []\n",
    "        for i, word in enumerate(sentence):\n",
    "            featureset = npchunk_features(sentence, i, history)\n",
    "            tag = self.classifier.classify(featureset)\n",
    "            history.append(tag)\n",
    "        return zip(sentence, history)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 对ConsecutiveNPChunkTagger的包装类，使之变成一个分块器\n",
    "class ConsecutiveNPChunker(nltk.ChunkParserI):\n",
    "    def __init__(self, train_sents):\n",
    "        tagged_sents = [\n",
    "            [((w, t), c) for (w, t, c) in nltk.chunk.tree2conlltags(sent)]\n",
    "            for sent in train_sents]\n",
    "        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)\n",
    "\n",
    "    def parse(self, sentence):\n",
    "        tagged_sents = self.tagger.tag(sentence)\n",
    "        conlltags = [\n",
    "            (w, t, c,)\n",
    "            for ((w, t), c) in tagged_sents]\n",
    "        return nltk.chunk.conlltags2tree(conlltags)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1） 第一个特征提取器\n",
    "#       最为简单，只使用了单词本身的标签作为特征，训练结果与unigram分类器非常相似\n",
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    return {'pos': pos}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  92.9%%\n    Precision:     79.9%%\n    Recall:        86.8%%\n    F-Measure:     83.2%%\n"
     ]
    }
   ],
   "source": [
    "# 验证基于分类器的分块器的性能（运行时间较长）\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[(('Confidence', 'NN'), 'B-NP'), (('in', 'IN'), 'O'), (('the', 'DT'), 'B-NP'), (('pound', 'NN'), 'I-NP'), (('is', 'VBZ'), 'O'), (('widely', 'RB'), 'O'), (('expected', 'VBN'), 'O'), (('to', 'TO'), 'O'), (('take', 'VB'), 'O'), (('another', 'DT'), 'B-NP'), (('sharp', 'JJ'), 'I-NP'), (('dive', 'NN'), 'I-NP'), (('if', 'IN'), 'O'), (('trade', 'NN'), 'B-NP'), (('figures', 'NNS'), 'I-NP'), (('for', 'IN'), 'O'), (('September', 'NNP'), 'B-NP'), ((',', ','), 'O'), (('due', 'JJ'), 'O'), (('for', 'IN'), 'O'), (('release', 'NN'), 'B-NP'), (('tomorrow', 'NN'), 'B-NP'), ((',', ','), 'O'), (('fail', 'VB'), 'O'), (('to', 'TO'), 'O'), (('show', 'VB'), 'O'), (('a', 'DT'), 'B-NP'), (('substantial', 'JJ'), 'I-NP'), (('improvement', 'NN'), 'I-NP'), (('from', 'IN'), 'O'), (('July', 'NNP'), 'B-NP'), (('and', 'CC'), 'I-NP'), (('August', 'NNP'), 'I-NP'), ((\"'s\", 'POS'), 'B-NP'), (('near-record', 'JJ'), 'I-NP'), (('deficits', 'NNS'), 'I-NP'), (('.', '.'), 'O')]\n"
     ]
    }
   ],
   "source": [
    "# 最初的是[（（单词，标签），分块）,...]\n",
    "chunked_sents = [\n",
    "    [((w, t), c)\n",
    "     for (w, t, c) in nltk.chunk.tree2conlltags(sent)]\n",
    "    for sent in train_sents[0:1]]\n",
    "print(chunked_sents[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ('pound', 'NN'), ('is', 'VBZ'), ('widely', 'RB'), ('expected', 'VBN'), ('to', 'TO'), ('take', 'VB'), ('another', 'DT'), ('sharp', 'JJ'), ('dive', 'NN'), ('if', 'IN'), ('trade', 'NN'), ('figures', 'NNS'), ('for', 'IN'), ('September', 'NNP'), (',', ','), ('due', 'JJ'), ('for', 'IN'), ('release', 'NN'), ('tomorrow', 'NN'), (',', ','), ('fail', 'VB'), ('to', 'TO'), ('show', 'VB'), ('a', 'DT'), ('substantial', 'JJ'), ('improvement', 'NN'), ('from', 'IN'), ('July', 'NNP'), ('and', 'CC'), ('August', 'NNP'), (\"'s\", 'POS'), ('near-record', 'JJ'), ('deficits', 'NNS'), ('.', '.')]\n"
     ]
    }
   ],
   "source": [
    "# 脱第一层“分块”得到[（单词，标签）,...]\n",
    "tagged_sent = nltk.tag.untag(chunked_sents[0])\n",
    "print(tagged_sent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['Confidence', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', \"'s\", 'near-record', 'deficits', '.']\n"
     ]
    }
   ],
   "source": [
    "# 再脱一层“标签”得到[单词,...]\n",
    "untagged_sent = nltk.tag.untag(tagged_sent)\n",
    "print(untagged_sent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 再脱一层“标签”就会报错\n",
    "# nltk.tag.untag(untagged_sent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "0.   ('Confidence', 'NN') B-NP\t --> \t{'pos': 'NN'}\n1.   ('in', 'IN') O\t --> \t{'pos': 'IN'}\n2.   ('the', 'DT') B-NP\t --> \t{'pos': 'DT'}\n3.   ('pound', 'NN') I-NP\t --> \t{'pos': 'NN'}\n4.   ('is', 'VBZ') O\t --> \t{'pos': 'VBZ'}\n5.   ('widely', 'RB') O\t --> \t{'pos': 'RB'}\n6.   ('expected', 'VBN') O\t --> \t{'pos': 'VBN'}\n7.   ('to', 'TO') O\t --> \t{'pos': 'TO'}\n8.   ('take', 'VB') O\t --> \t{'pos': 'VB'}\n9.   ('another', 'DT') B-NP\t --> \t{'pos': 'DT'}\n10.   ('sharp', 'JJ') I-NP\t --> \t{'pos': 'JJ'}\n11.   ('dive', 'NN') I-NP\t --> \t{'pos': 'NN'}\n12.   ('if', 'IN') O\t --> \t{'pos': 'IN'}\n13.   ('trade', 'NN') B-NP\t --> \t{'pos': 'NN'}\n14.   ('figures', 'NNS') I-NP\t --> \t{'pos': 'NNS'}\n15.   ('for', 'IN') O\t --> \t{'pos': 'IN'}\n16.   ('September', 'NNP') B-NP\t --> \t{'pos': 'NNP'}\n17.   (',', ',') O\t --> \t{'pos': ','}\n18.   ('due', 'JJ') O\t --> \t{'pos': 'JJ'}\n19.   ('for', 'IN') O\t --> \t{'pos': 'IN'}\n20.   ('release', 'NN') B-NP\t --> \t{'pos': 'NN'}\n21.   ('tomorrow', 'NN') B-NP\t --> \t{'pos': 'NN'}\n22.   (',', ',') O\t --> \t{'pos': ','}\n23.   ('fail', 'VB') O\t --> \t{'pos': 'VB'}\n24.   ('to', 'TO') O\t --> \t{'pos': 'TO'}\n25.   ('show', 'VB') O\t --> \t{'pos': 'VB'}\n26.   ('a', 'DT') B-NP\t --> \t{'pos': 'DT'}\n27.   ('substantial', 'JJ') I-NP\t --> \t{'pos': 'JJ'}\n28.   ('improvement', 'NN') I-NP\t --> \t{'pos': 'NN'}\n29.   ('from', 'IN') O\t --> \t{'pos': 'IN'}\n30.   ('July', 'NNP') B-NP\t --> \t{'pos': 'NNP'}\n31.   ('and', 'CC') I-NP\t --> \t{'pos': 'CC'}\n32.   ('August', 'NNP') I-NP\t --> \t{'pos': 'NNP'}\n33.   (\"'s\", 'POS') B-NP\t --> \t{'pos': 'POS'}\n34.   ('near-record', 'JJ') I-NP\t --> \t{'pos': 'JJ'}\n35.   ('deficits', 'NNS') I-NP\t --> \t{'pos': 'NNS'}\n36.   ('.', '.') O\t --> \t{'pos': '.'}\n"
     ]
    }
   ],
   "source": [
    "history=[]\n",
    "for i, (word, tag) in enumerate(chunked_sents[0]):\n",
    "    print(str(i)+'.  ', word, tag, end='\\t --> \\t')\n",
    "    feature_set = npchunk_features(tagged_sent, i, history)\n",
    "    print(feature_set)\n",
    "    history.append(tag)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  93.6%%\n    Precision:     82.0%%\n    Recall:        87.2%%\n    F-Measure:     84.6%%\n"
     ]
    }
   ],
   "source": [
    "# 2） 第二个特征提取器\n",
    "#       使用了单词前面一个单词的标签作为特征，效果类似于bigram分块器\n",
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    if i == 0:\n",
    "        prevword, prevpos = '<START>', '<START>'\n",
    "    else:\n",
    "        prevword, prevpos = sentence[i - 1]\n",
    "    return {'pos': pos, 'prevpos': prevpos}\n",
    "\n",
    "\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  94.6%%\n    Precision:     84.6%%\n    Recall:        89.8%%\n    F-Measure:     87.1%%\n"
     ]
    }
   ],
   "source": [
    "# 3） 第三个特征提取器，使用了单词本身的标签、前一个单词、前一个单词的标签作为特征\n",
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    if i == 0:\n",
    "        prevword, prevpos = '<START>', '<START>'\n",
    "    else:\n",
    "        prevword, prevpos = sentence[i - 1]\n",
    "    return {'pos': pos, 'word': word, 'prevpos': prevpos}\n",
    "\n",
    "\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "ChunkParse score:\n    IOB Accuracy:  96.0%%\n    Precision:     88.3%%\n    Recall:        91.1%%\n    F-Measure:     89.7%%\n"
     ]
    }
   ],
   "source": [
    "# 4) 第四个特征提取器，使用了多种附加特征\n",
    "#   * 预取特征\n",
    "#   * 配对功能\n",
    "#   * 复杂的语境特征\n",
    "#   * tags-since-dt：用其创建一个字符串，描述自最近限定词以来遇到的所有词性标记\n",
    "def tags_since_dt(sentence, i):\n",
    "    tags = set()\n",
    "    for word, pos in sentence[:i]:\n",
    "        if pos == 'DT':\n",
    "            tags = set()\n",
    "        else:\n",
    "            tags.add(pos)\n",
    "    return '+'.join(sorted(tags))\n",
    "\n",
    "\n",
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    if i == 0:\n",
    "        prevword, prevpos = '<START>', '<START>'\n",
    "    else:\n",
    "        prevword, prevpos = sentence[i - 1]\n",
    "    if i == len(sentence) - 1:\n",
    "        nextword, nextpos = '<END>', '<END>'\n",
    "    else:\n",
    "        nextword, nextpos = sentence[i + 1]\n",
    "    return {\n",
    "        'pos': pos,\n",
    "        'word': word,\n",
    "        'prevpos': prevpos,\n",
    "        'nextpos': nextpos,\n",
    "        'prevpos+pos': '%s+%s' % (prevpos, pos),\n",
    "        'pos+nextpos': '%s+%s' % (pos, nextpos),\n",
    "        'tags-sincce-dt': tags_since_dt(sentence, i)\n",
    "    }\n",
    "\n",
    "\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  }
 ]
}